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Abstract 

Background: Molecular data, e.g. arising from microarray technology, is often used for predicting survival 
probabilities of patients. For nnultivariate risk prediction models on such high-dimensional data, there are established 
techniques that combine parameter estimation and variable selection. One big challenge is to incorporate 
interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and 
incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is 
computationally too expensive to check all possible interactions. 

Results: We use a boosting technique for estimation of effects and the following building blocks for pre-selecting 
interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation 
study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity 
in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main 
effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of 
interactions frequently may not be large enough to improve prediction performance, even though the interactions 
are potentially of biological relevance. 

Conclusion: Screening interactions through random forests is feasible and useful, when one is interested in finding 
relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of 
interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study 
emphasizes the importance of making full use of existing methods in addition to establishing new ones. 

Keywords: Boosting, High-dimensional data. Model selection. Model complexity. Prediction error curves. Random 
forest. Time to event settings 



Background 

As more and more high-dimensional molecular data is 
amassed, the importance of biomarker research increases. 
Specifically, predictive biomarkers are usually wanted in 
order to predict risks associated with diseases. When 
building multivariate risk prediction models for finding 
such biomarkers, it is desirable to produce sparse models. 
The sparsity of the resulting models facilitates the biologi- 
cal and statistical interpretation [1-4]. Approaches such as 
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componentwise boosting [5] or the LASSO [6-8] achieve 
sparsity by performing variable selection and parame- 
ter estimation simultaneously. There are two frequently 
occurring problems in this context: first, lack of repro- 
ducibility of variable selections across different studies, for 
example concerning gene expression data [9-11]; second, 
no established approaches to account for interactions. The 
latter deficit can lead to selection of wrong variables or 
biased parameter estimations. The first problem, i.e. the 
inability to confirm most of the published gene related sig- 
natures, has led to doubts whether signatures should be 
produced at all. However, the failure of finding stable sig- 
natures could to some extent be ascribed to inadequate 
modeling. Approaches that are more comprehensive are 
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necessary, for example, combining molecular data with 
annotation and clinical information [9,12-15]. One ingre- 
dient should be to incorporate promising interactions in 
the model Many tools for modeling interactions exist, 
but, as far as we know, no systematic investigations of 
potential building blocks are available. 

Examples for promising modeling strategies that 
can account for interactions are penalized regression 
models [16,17], logic regression [18,19], multifactor- 
dimensionality reduction [20,21] or random forests 
[22,23]. For a comprehensive review regarding interac- 
tion (pre-) selection approaches, we refer to [24]. Logic 
regression and multifactor-dimensionality reduction are 
primarily destined for discrete marker data, e.g., for single 
nucleotide polymorphism data. In contrast to that, penal- 
ized or regularized regression models cover more general 
types of data. Their main property is to put a penalty 
on the model parameters, which correspond to marker 
effects, for estimation. The usage, for example, of an Li- 
penalization forces most of the estimated parameters to 
be zero, i.e., the values of the corresponding covariates do 
not influence predictions obtained from the fitted model. 
Even though these models are primarily used for main 
effect selections, there is an increasing interest in incor- 
porating interactions [25-27]. When there is no a priori 
knowledge, such approaches either require the interac- 
tions to be formed by variables that represent main effects 
or that interaction terms are created by combining the 
covariates in a certain way, e.g., by producing all dis- 
tinct two-way interactions (or by coarsening the input 
space before producing the interactions [28]). The first 
route can lead to false negatives even if the true interac- 
tions have relevant marginal effects, and the second one 
neglects the fact that it is frequently either not feasible 
or computationally too expensive to consider all possi- 
ble interactions. Altogether, this means that a screening 
method for promising interaction terms is in most cases 
necessary, especially for higher order interactions. The 
potential of random forests to provide non-parametrical 
means for handling various kinds of interaction struc- 
tures makes them attractive as an interaction screening 
method for penalized regression models. However, apart 
from some interesting theoretical results (see [29,30]) and 
positive empirical findings regarding prediction perfor- 
mances (e.g., [31-33]), the ability to extract information 
from random forests is considered problematic. The main 
objection is that established variable importance mea- 
sures seem to be unable to detect relevant interaction 
effects in the absence of strong marginal components 
[34-36]. 

Variable importance measures (VIMs) for random 
forests are meant to extract the information contained in 
forests. Established VIMs are the Giniy the permutation 
accuracy, or the minimal depth importance (see [37-39]). 



The first measure uses the mean improvements in the 
Gini index in a forest related to the investigated variable. 
Permutation accuracy importance measures the change 
in prediction accuracy of the forest when the values of 
a variable are permutated randomly, and minimal depth 
importance is roughly related to the mean minimum dis- 
tance (the depth) from the root node to the investigated 
variable. These measures can also be used for finding 
interactions in the forest. For example, the permutation 
accuracy importance can easily be extended such that the 
values of two variables are permutated randomly [37]. 
These variable importance measures lead to a ranking of 
variables, in which interaction information is assumed to 
enter in some way. Whether these interactions are statis- 
tically relevant can be evaluated by penalized regression 
models. Hence, a comprehensive evaluation can consist of 
two parts: extracting interaction terms based on random 
forest information and estimating a statistical regression 
model based on all available variables and identified inter- 
action terms. 

In this paper, we show building blocks for evaluating 
and incorporating interactions terms in high-dimensional 
time-to-event settings, in particular for settings in which 
it is very computationally expensive to check all pos- 
sible interactions with an exhaustive search algorithm. 
The main ingredients are random survival forests (RSF), 
a specific adaptation of random forests to time-to-event 
settings, and an incremental stagewise forward regres- 
sion technique, called CoxBoost [40-42]. CoxBoost is a 
boosting technique based on the Cox proportional haz- 
ards model and combines variable selection with model 
estimation. For this purpose, it uses a penalized version 
of the partial log-likelihood and applies componentwise 
boosting. We investigate the effect of a combination of 
these approaches, the additional contribution of resam- 
pling, and the advantage of a special data pre-processing 
step. This work is a feasibility study; hence, we are first 
of all interested in investigating how several components 
can contribute to the solution of the interaction finding 
problem. The specific choice of the investigated tools is 
justified by their specific properties; however, there are 
alternatives to our decisions (see above). We are interested 
in predicting risks within time-to-event settings and we 
use methods established in these settings. In this context, 
we rely on some assumptions, such as the proportional 
hazard assumption. 

In the next section, we present details of CoxBoost and 
RSF together with corresponding VIMs. After presenting 
evaluation tools and our interaction detection strategy, 
we outline a simulation design for the evaluation. In the 
Results Section, the findings of the simulation study are 
shown, and we illustrate our approach on two real-world 
applications. Finally, we describe limitations of the study 
and summarize our findings in the Conclusion Section. 
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Methods 

Time-to-event or survival data for n investigated entities 
is typically given as a set of triples Zi = {fu ^i)^ i = 
1, . . . , The first component is the observed time for each 
entity / and is given by ti = min(r^, Q), where Ti is the 
event time and Q is the censoring time from which on 
the entity is no longer observed. The second component 
is the event indicator <5^, which takes the value 1 if an event 
has occurred at the observed time {Ti < Q) and 0 if 
the event time is censored (Ti > Ci), The third element, 
Xif IS the vector of values of the p covariates observed at 
baseline. 

CoxBoost 

As a forward stagewise regression technique in the time- 
to-event setting, we use a likelihood-based boosting vari- 
ant, called CoxBoost [15,43]. This technique is based on 
the Cox proportional hazards model, which relates the 
hazard A.(^|x/), i.e. the instantaneous risk of having an 
event at time given the covariate information in Xi, for 
entity /, in the following way: 

k(t\xi) = Xo(t) expixfp), 

where the baseline hazard Xo(t) is left unspecified. Usu- 
ally, the parameter vector p = (y^i, . . . , Pp)-^ is estimated 
by maximizing the partial log-likelihood (PLL): 



with indicator function / (see also [44]). However, such a 
procedure is not feasible for p > n. Therefore, CoxBoost 
uses a penalized version of the PLL and applies compo- 
nentwise likelihood-based boosting [40,41,45]. Conven- 
tional CoxBoost starts with parameter estimates ^^^^ = 
(0, . . . , 0)^. In each boosting step k = 1, . . . only 
one coefficient is updated. In order to determine which 
component /* should be updated in step the penal- 
ized univariate PLL with argument Oj^\j e {1, . . . is 
considered: 

with fixed penalty parameter p > 0 and the variable 
parameter Oj^\ In PLL (^^y^^^^^ all parameter components 
with indices unequal to ; are set to the corresponding 
components of The parameter vector compo- 

nent ;* is the one that leads to the maximum value of 

VLLpen (^^/^^^' Instead of maximizing the penalized PLL 
for each candidate /, using the standard Newton-Raphson 



algorithm, the penalized score statistic can be used as a 
criterion 



where Uj^"^ is the value of the score function U(0) = 
d?LL(e)/de for e = O^''^ = O, and 1^ is the value of the 
Fisher information 1(0) = d^VLL(0)/dO^, again for 0 = 
0^^^ = 0. The covariate /* with the largest value of the 
score statistic is selected for an update of the form: 



while 



p(k) 



aik-l) 



— for all covariates j ^ The tuning 

parameter p is typically set to J]^- 5/ • — 1), with v e (0, 1] 
as the relative step size factor. The number of boosting 
update steps can be determined by a cross-validation 
procedure. 

One salient feature of this forward stagewise regression 
technique is that it inherently avoids 'breaking up a large 
main effect coefficient into a sum of smaller pieces' in con- 
trast to, for example, non-boosted regression models with 
L2-penalization (see [16]). In addition to that, CoxBoost 
has many extensions. It is, for example, possible to force 
the inclusion of a number of covariates into the model 
by suspending penalization for them [15]. This is rele- 
vant for settings with few clinical covariates and a large 
number of molecular variables. In this case, the coefficient 
estimates of the mandatory covariates are updated before 
the other covariates. Further, more than one coefficient 
can be updated in each boosting step, or the penaliza- 
tion parameter can vary from step to step. CoxBoost 
and all these features are implemented as an R-package, 
correspondingly called CoxBoost [46]. 

Random forests 

Random forests are ensembles of - usually binary - clas- 
sification or regression trees [22] . Usually unpruned trees 
are generated based on resamples of the original data and 
a random component in the splitting procedure, which 
implies that in every knot splitting is based on the num- 
ber mtry of randomly selected variables. Unpruned trees 
in the context of random forests are rather unproblem- 
atic in terms of overfitting on training data; however, 
they can have detriment effects on the consistency of the 
response estimations [29,47]. Each path in such gener- 
ated trees represents a sequence of splits that leads to 
the response of cases corresponding to that path. The 
final model response is determined by aggregation, e.g. 
averaging the responses of a case over all trees. 

Random forests can detect and deal with small effects, 
interactions and non-linear associations, making no as- 
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sumptions about the corresponding functional form [48] . 
All of these characteristics are also valid for trees. How- 
ever, one important rationale behind random forests is 
the de-correlation of information that is represented in 
single trees, which reduces the corresponding variances - 
a bagging phenomenon [30,49] - and the grouping prop- 
erty of trees. The latter property relates to the fact that 
a split on a variable from a cluster of correlated variables 
is frequently followed by splits of other members of that 
group [50]. A further advantage of forests over trees is 
that they can approximate smooth functions without the 
necessity of having a large number of leaves in a tree, 
due to the smoothing effect of the bagging phenomenon 
[51]. Random forests perform relatively well off the shelf 
[52] with the default- values for mtry {=^) and for the 
number of trees in a forest (=1000). 

As one specific adaptation of random forests to right- 
censored time-to-event data, we consider random survival 
forests (RSF) [53]. For a - computationally expensive - 
alternative, see party. The response for RSF is the 
cumulative hazard function (CHF), defining an ensemble 
predicted value with respect to mortality'. For splitting, 
typically the Logrank test is used [54]. Hence, the homo- 
geneity of nodes in the tree is a result of maximizing the 
difference of event probabilities between daughter nodes. 
For each entity in the data set, the ensemble CHF is cal- 
culated by averaging the Nelson-Aalen estimator of all 
leaves, into which the entity drops [53,55]. For a terminal 
node h with N(h) distinct event times ti^h < h^h < • • • < 
tN(h),h> this estimator is given as 

where di^^ and Y/^/^ are the number of deaths and entities 
at risks at time ti^h. RSF are implemented in the R-package 
randomSurvivalForest [56]. 

Variable importance measures for random forests 

Various variable importance measures ( VIMs) can be used 
for selecting variables. There are two well-known VIMs: 
Gini importance and permutation accuracy importance 
(PAM). Another VIM is the mean minimal-depth mea- 
sure, which has been proposed recently. Roughly, it mea- 
sures the shortest distance (depth) from the root node to 
the parent node of the maximal subtree (the largest sub- 
tree whose root node splits with respect to the variable 
investigated). For further details, we refer to [50]. Differ- 
ent VIMs can produce different rankings; for example, 
the Gini importance was found to be highly affected by 
selection bias, e.g., continuous variables are preferred to 
categorical variables with only few categories [38] . In the 
following, we focus on PAM, because it is widely accepted 



and relates to the concept of simulating a null distri- 
bution (necessary for computing p-values), even though 
we are aware of potential problems [50,57]. For further 
information regarding VIMs, we refer to [58] and [59]. 

There are two versions of PAM. In its common version it 
is computed with respect to random permutations of the 
components oixj = {xy, . . . , Xnj)'^ , which breaks the asso- 
ciation oixj with the response and all variables. In a more 
sophisticated variant, which is unique to RSF with respect 
to survival data [50], the vector Xi = {xn, . . .,Xip) related 
to entity / is dropped down in all trees, in which it was out 
of bag in the training process; whenever a split node for 
an investigated variable is encountered, the correspond- 
ing vector Xi is randomly assigned to one of the daughter 
nodes. In both variants, the variable importance results 
from the prediction error of the altered forest minus the 
prediction error of the non-altered forest. The larger the 
importance values of a variable, the higher its value for 
prediction. It is important to notice that PAM is tied to the 
error measure used. One frequently used error measure 
for RSF, which we use here as well, is Harrells concor- 
dance index, which measures the discrimination ability of 
a model [60]. 

Tools for finding effects in time-to-event data 

In high-dimensional settings, the problems of extract- 
ing relevant information by regression models are aggra- 
vated compared to the low-dimensional counterparts. For 
example, even if stepwise regression introduces biases 
related to multiple test problems (see, for example, 
[61,62]), it nevertheless provides a means for tackling vari- 
able selection issues in a comprehensive manner. It is 
therefore crucial to investigate mechanisms and measures 
for an adequate model selection on high-dimensional 
data. Three issues have to be addressed simultaneously: 
(1) a sparse variable selection, (2) representing the rele- 
vant structure in the data, and (3) good prediction per- 
formance. We try to tackle these issues and in particular 
concentrate on integrating substantial interactions into 
the model. 

The likelihood-based boosting algorithm promises 
sparse and stable variable selection, which is a conse- 
quence of simultaneous selection and estimation in a 
multivariable model. Naturally, variable selection stability 
also depends on the quality of the data (see, for exam- 
ple, [63]), and for obtaining high-quality molecular data 
frequently appropriate pre-processing steps are necessary, 
e.g., background correction and normalization. Concern- 
ing the other two issues (representing the relevant struc- 
ture in the data and good prediction performance) Yang 
[64] strikingly demonstrates that best predictive models 
usually contain irrelevant features and important features 
often do not lead to best prediction performances (see also 
[65,66]). Whenever we encounter the trade-off between 
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relevance and usefulness for prediction, we prioritize 
'finding relevant variables' over prediction performance. 

The models are evaluated within a resample procedure 
for estimating sensitivity and stability. As a performance 
measure adapted for time-to-event endpoints, we use the 
Brier score [67,68]. The Brier score is a strictly proper 
scoring rule, i.e. it is optimal only at the true probability 
model (see [69]). For example, the area under the curve 
(AUG) is not a strictly proper rule, because it can lead 
to optimal values for different probability models (slight 
changes of probabilities often do not matter). Two com- 
mon resampling techniques are cross-validation (C V) and 
bootstrapping. Cross-validation partitions the data into 
folds and evaluates prediction performance on every sin- 
gle fold with models fitted to the data from the remaining 
folds; a more precise characterization for CV is therefore 
subsample technique'. Both techniques can cause prob- 
lems (see [38,70]), and we decided to use subsampling 
with splits of relative size 0.632 to (1-0.632), because this 
seems to work well in many settings [71,72]. Such a sub- 
sampling procedure is roughly comparable to a 3-fold CV 
(see [73,74]). 

The Brier score quantifies the squared deviation 
between predicted survival probability and observed sur- 
vival status and is independent from the assumed survival 
model. When Hq is the estimated cumulative baseline haz- 
ard at baseline and denotes the estimated coefficients, 
the predicted survival probability is given by 

7t(t,x) = I — exp (^—Hoit) exp(x^ P)^ 

and the expected Brier score tracked over time (i.e., the 
expected prediction error curve) has the form 

Err(t;7t) := Ex [{8(t) - 7t(t,x)f^ , 

where 8(t) is the true survival status at time t Typically 
the survival status at time t will be right censored for 
some observations. Thus, inverse probability of censoring 
weights (IPCW) were proposed to avoid the related bias 
[68,75]. The IPCW for individual / is defined as 

P(ti - \Xi) P{t\Xi) 

where P{s\xi) is a consistent estimate of probability that 
the censoring time is larger than 5, given Xi. /(•) is again 
the indicator function. The cross-validation estimate of 
the Brier score tracked over time is then 

_ 1^1 



Here, B is the number of resamples, n the number of 
rows, and the indices of those cases that are included 
in the resample b. 

Assembling of building blocks into an interaction 
detection strategy 

Our comprehensive strategy consists of three parts: (a) 
first main effect detection, (b) pre- selection of inter- 
actions terms, (c) final model selection. Parts (a) and 
(c) use CoxBoost and are fixed. Here, we rely on the 
ability of CoxBoost to produce sparse models and to 
include important variables. In part (b), we consider 
the following building blocks: (BBl) subsampling, (BB2) 
random forests, and (BB3) orthogonalization as a data 
pre-processing step. Different decisions concerning the 
building blocks lead to flexibility in part (b). When 
combining building blocks into comprehensive strate- 
gies, over-fitting to the data at hand and over-optimism 
could occur [76] . One way to account for that - besides 
the usage of independent validation data sets - is to 
evaluate the contribution of the building blocks to the 
results. 

The use of an outer subsampling for interaction find- 
ing has the aim of enhancing the credibility of interaction 
information. Specifically, we use the variable inclusion 
frequency (VIFs), i.e. the proportion of times that the 
variable appeared in the model, for assessing the rele- 
vance of an interaction term. For example, when using 
random forests, this means that the number of random 
forests in which interaction terms are deemed relevant 
is the basis for a pre-selection of interactions. Here, an 
interaction term is assessed as relevant if both underlying 
variables have PAM values larger than zero in a ran- 
dom forest (typically, there are many variables with PAM 
values < 0). In other words, variables have to be simulta- 
neously important for a random forest. 

When all building blocks are used for the pre-selection 
of interactions terms, random forests are applied to 
the data in a subsampling context and orthogonaliza- 
tion is used as a data-pre-processing step. Orthogonal- 
ization means that all variables not considered as main 
effects are made orthogonal to those that are indicated 
as main effects by CoxBoost in the first step. This leads 
to disentanglement of information, which might allow to 
determine variables and related interactions that contain 
information that was originally masked by main effects 
(a similar idea is employed in [27]). The strategy using 
all building blocks is described by the following pseudo- 
algorithm (rsf-VIF-res): 

1. Specify: Indices /C of clinical covariates or other 
known main effect variables, number S of subsamples 
for pre-selecting interactions, and number of pre- 
selected interaction terms. In case of identical VIF 
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values for the Rth and {R + l)th found interaction, all 
interactions with that VIF value are included as well 
2. Subsample the original data set Z in relation 0.632 to 
(1-0.632), leading to the data sets and Z^/. 

(a) First pass main effects detection: Run 

CoxBoost on Z^, possibly incorporating 
clinical covariates {xf^k e /C} without 
penalization. This leads to the model 
CoxBoostM and a list of main effects, given 
by the index set Ai. Main effects and clinical 
covariates are used for orthogonalization (if 
this pre-processing step is considered) in the 
pre-selection step and as unpenalized 
variables in the final CoxBoost model. 

(b) Pre-selection of interaction terms: If 
MU ICis non-empty, regress all covariates 
with indices {1, \ (A4 U /C) on the 
variables in (M. U /C). Subsequently, compute 
the corresponding residuals of the covariates, 
which leads to the data matrix Z^ (building 
block (BB3)). Subsample S times data from 
Zh - from Zh, when M is empty - in relation 
of 0.632 to (1-0.632) and generate RSF on 
each larger subsample (building blocks (BBl) 
and (BB2)). Construct interaction terms by 
all pairs of variables with PAM values greater 
0 on every subsample and compute VIFs of 
the interactions terms at the end of the 
subsampling process. Select the R most 
frequent pairs. 

(c) Final model: covariates are x/^^ k e IC, 

Xi, i e and the R selected cross product 
terms of (b). Run CoxBoost on Z/, with these 
covariates without penalization for covariates 
with indices in /C, leading to model c^fin. 

(d) Compute prediction error: Apply cZ?fin on 
Z^ and compute the Brier score. 

For assessing the contribution of building blocks, we 
successively remove one of them in the pre-selection step, 
leading to following alternatives to step (b): 

bl Do the same as in rsf-VIF-res, but without 
orthogonalization. (rsf-VIF) 

b2 Replace rsf in rsf-VIF by CoxBoost: subsample S times 
data from Zy in relation 0.632 to (1-0.632) and run 
CoxBoost on each of the larger data sets. Finally, 
compute VIFs related to the variables selected by 
CoxBoost in each subsample, and create pairs, i.e. 
interaction terms, related to the variables with the 
highest VIFs. (cb-VIF) 

b3 Omit subsampling in cb-VIF: compute all distinct 
cross product terms of covariates with indices in M. 
Here, S is superfluous and R is not needed, if 



I VTMTI < R) otherwise, select randomly R 
interactions from all cross product terms, (cb-crossp) 

There are many more alternatives, which are not consid- 
ered due to a limited space and for reasons of clarity. 

Simulation design 

For a systematic analysis of the building blocks and the 
corresponding interaction detection strategies, a time-to- 
event simulation study was conducted. Here, we define 
interactions as effects based on multiplicative combina- 
tions of variables. The main interest concerns the ability 
of the strategies to find relevant interactions and espe- 
cially those that might be difficult to detect, i.e., variables 
in interactions are not members of the set of true main 
effects. The secondary focus is on the prediction perfor- 
mance, which highly depends on the effect sizes of main 
effect variables and interaction terms. 

The simulation scenarios are designed to mimic simple 
yet realistic settings, e.g. microarray studies. We simulate 
independent as well as correlated data for a time-to-event 
end point. Table 1 summarizes the scenarios and shows 
the effect sizes of the main effects and interactions. The 
number of covariates is fixed as 1000 (=/?), the sample size 
is fixed as 150 (=n), and all covariates are from a stan- 
dard normal distribution (except for Sim22_bin with 4 
binary variables and the scenarios with correlated data). 



Table 1 The effect sizes of non-zero effects in each scenario 


Scenarios: 


Effect size ME 


Effect size Int 


Corr value Block 








size 


Sim42 


(3,3,-3,-3) 


(5,-5) 




Sim22_1.0 


(0.9, -0.9) 


(1.0,-1.0) 




Sim22_0.5 


(0.9, -0.9) 


(0.5, -0.5) 




Sim22_0.25 


(0.9,-0.9) 


(0.25,-0.25) 




Sim22_1.5 


(0.9,-0.9) 


(1.5,-1.5) 




Sim22_2.0 


(0.9, -0.9) 


(2.0,-2.0) 




Sim22_2.5 


(0.9, -0.9) 


(2.5,-2.5) 




Sim22_bin 


(0.9, -0.9) 


(1.0,-1.0) 




Sim22_corr01 


(0.9,-0.9) 


(1.0,-1.0) 


0.1 5 


Sim22_corr03 


(0.9, -0.9) 


(1.0,-1.0) 


0.3 5 


Sinn22_corr05 


(0.9, -0.9) 


(1.0,-1.0) 


0.5 5 


Sinn22_corr07 


(0.9, -0.9) 


(1.0,-1.0) 


0.7 5 



ME: main effect. Int: interaction. Corr: correlation. In all scenarios, the samples 
size is 1 50 {=n) and the number of covariates is 1 000 (=p). The effect sizes are 
given in the form '(coefficient value of effect 1 , coefficient value of effect 2)'. For 
the scenarios with correlations, p divided by the block size (200) gives the 
dimension of the normal distribution from which the values of the variables in a 
block are sampled, and the correlation value is the value at the off-diagonals of 
the corresponding covariance matrix. In scenarios Sim42 and Sim22_1 .0, all 
interaction detection strategies are covered. All other scenarios are used for 
investigating rsf-VIF-res. 
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The covariates not indicated in the table have zero effect 
sizes. For each simulation scenario, 50 datasets are gen- 
erated. Survival times and censoring times are generated 
from an exponential distribution with baseline hazard A = 
^ (see also [77]). 

Sim42 refers to the simulation case with 4 main effects 
and 2 interactions that are composed of the main effects 
and Sim22_:v to the cases with 2 main effects and 2 inter- 
actions that are not related to these main effects; in other 
words: they are composed of variables that have zero 
effect sizes. If x is numeric, it gives the uniform effect 
size; x = "bin" denotes the case of interactions composed 
of binary variables, and x beginning with "corr" relates 
to cases with variables that are block-correlated with a 
uniform correlation coefficient c, c G {0.1,0.3,0.5,0.7}, 
across the 200 fiver-blocks, i.e., values of variables are 
sampled from a 5-dimensional normal distribution with 
the same variance matrix (the same correlation value at 
the off-diagonals and variance of 1) over all blocks. Here, 
main effects and variables in interactions terms stem from 
different blocks. 

In scenarios Sim42 and Sim22_1.0, all interaction detec- 
tion strategies are considered for evaluating the effect 
of the building blocks. The other scenarios are used 
to investigate the behavior and the limits of rsf-VIF-res. 
The simple scenario Sim42 is used for ascertaining that 
the strategies are capable of finding the relevant main 
effects. Scenario Sim22_1.0 is the reference scenario for 
the scenarios with non-smooth interactions, i.e. inter- 
actions incorporating binary covariates, and correlated 
variables. 

The performance of a strategy is measured by the num- 
ber of correct non-zero variables in the models, i.e. the 
variable selection sensitivity with respect to the main 
effects and the interaction terms, and by the prediction 
error (Brier score). Specificity values or predictive val- 
ues are not separately listed in the result tables. However, 
these measures can be deduced from the sensitivity values 
and the number of selected variables. 

In order to obtain one interpretable measure for the pre- 
diction performance, the Brier scores tracked over time 
are aggregated by computing the integrated prediction 
error curves (IPECs) for each model. Furthermore, the 
IPECs of the estimated models (IPECs-) are considered 
relative to the IPEC of the corresponding Kaplan-Meier 
estimator (IPECkm)^ 



rIPEC, := 



IPECkm - IPEC5, 
IPECkm 



In other words, rIPEC gives the relative improvement 
of prediction performance of strategy Si compared to the 
prediction performance of the Kaplan-Meier. 



Results 

Simulation study 

The simulation was conducted in R-3.0.2 with following 
main settings for the model implementations used in our 
strategies. Parameters not listed are considered secondary 
and were set to their default values: 

CoxBoost 

penalty: (Number of events) -(q^ ~ !)♦ Penalty value for 
the updates in each boosting step, 
standardize: TRUE. Covariates are standardized, 
stepno: As computed by cv.CoxBoost. Number of boost- 
ing steps. 

RSF 

mtry: Square root of the number of variables (default 
value). 

ntree: 1000. Number of trees grown (default value). 

The parameter values of the strategies described in the 
Methods Section were chosen in the following way: no 
clinical covariates, hence JC = {}; the number of sub- 
samples {S) for pre-selecting interactions was 50, and the 
number of pre-selected interaction terms (R) was 10000. 
For all scenarios, data were are randomly generated 50 
times. The results for scenarios Sim42 and Sim22_1.0 are 
given in Table 2. The relevant columns are: the number 
of selected interactions by the corresponding screening 
method (IntScreen), the number of total variables in the 
final model ( VarsTotal), the sensitivity with respect to the 
inclusion of true main effects (MainSensi), the sensitivity 
with respect to the availability of true interactions from 
the screening step (IntSensiA), the sensitivity with respect 
to the inclusion of true interactions in the final model 
(IntSensi), and the rIPEC values of CoxBoostM and the 
final model. For simplifying the discussion of the results, 
we will abbreviate the phrase 'random forests together 
with PAM' by 'random forests' or 'RSF! 

In scenario Sim42, use of random forests generate mod- 
els with more than 30 variables in the mean (about 
twice the number seen Sim22_1.0), whereas the other two 
strategies result in less than 20 variables on average. This 
means that there are many false positive findings when 
using RSF, which has a negative impact on the rIPEC com- 
pared to the cb-strategies. On the other hand, rsf-VIF-res 
leads to the largest sensitivity values for main effects and 
interactions. Hence, there is a trade-off between sensitiv- 
ity and prediction performance. For all pre-selection vari- 
ants it seem that when true interaction are pre-selected 
(see IntSensiA), then almost all of them are selected 
in the final model. Comparing cb-crossp with cb-VIF, 
we see that subsampling can increase IntSensi without 
decreasing MainSensi, and this leads to the best rIPEC 
value in scenario Sim42. Use of random forest instead of 
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Table 2 Results of the simulation study for all strategies in scenarios Sim42 and Sim22_1 .0 

IntScreen IntSensiA VarsTotal MainSensi IntSensi rIPEC 

CoxBoostM Final Model 

Scenario Sim42 



(cb-crossp) 
(cb-VIF) 
(rsf-VIF) 
(rsf-VIF-res) 



213.78 
1573.06 
24557.74 
25740.32 



0.7 (0.05) 
0.88 (0.03) 
0.95 (0.02) 
0.97 (0.02) 



14.3 (6.29) 
1 9.42 (7.89) 
34.74 (7.97) 
32.02 (6.56) 



0.845 (0.04) 
0.845 (0.04) 
0.82 (0.04) 
0.845(0.05) 



0.7 (0.05) 
0.88 (0.03) 
0.94 (0.02) 
0.97 (0.02) 



0.12(0.12) 
I 
I 

0.12(0.122) 



0.4 (0.19) 
0.43 (0.15) 
0.37(0.15) 
0.4 (0.12) 



Scenario Sim22_1 .0 

(cb-crossp) 89.56 0(0) 17.28(8.18) 0.95(0.02) 0(0) 0.12(0.09) 

(cb-VIF) 2058.5 0(0) 19.12(12.19) 0.87(0.03) 0(0) | 

(rsf-VIF) 17511.96 0.07(0.01) 12.64(10.85) 0.73(0.07) 0.06(0.02) | 

(rsf-VIF-res) 17701.72 0.4(0.05) 19.9(12.93) 0.81(0.04) 0.39(0.05) 0.12(0.09) 



0.09 (0.13) 
0.09 (0.1) 
0.08 (0.08) 

0.14(0.14) 



IntScreen (given as mean) is the number of selected interactions by the corresponding screening method; IntSensiA (given as 'sensitivity value (sd)') is the sensitivity 
related to the availability of true interactions; VarsTotal (given as 'mean (sd)') is the number of total variables in the final model; MainSensi (given as 'sensitivity value 
(sd)') is the sensitivity related to the inclusion of true main effects; and IntSensi (given as 'sensitivity value (sd)') is the sensitivity related to the inclusion of true 
interactions. The rIPEC values (given as 'mean (sd)') are shown for CoxBoostM and the final model. The scenarios were repeated 50 times. Additional file 1 : Figure SI 
provides boxplots for further insights into the nature of the variability in rIPEC. 



CoxBoost for interaction pre-selection (rsf-VIF) increases 
IntSensi but leads to a slight reduction of MainSensi. 
The MainSensi and IntSensi values of rsf-VIF-res indicate 
that orthogonalization is not only important for further 
increasing IntSensi but also for a higher MainSensi value 
compared to rsf-VIF. Overall, in this scenario, subsam- 
pling is important and random forests should be applied 
on orthogonalized data for achieving the largest sensitiv- 
ity values but even then, prediction performance cannot 
be improved compared to interaction pre-selection with 
CoxBoost. 

Scenario Sim22_1.0 exhibits some differences to Sim42. 
First, all strategies lead to similar and moderate num- 
bers of total variables in the final model. Hence, high 
IntScreen values in RSF strategies do not result in more 
false positives than interaction pre-selection variants that 
use CoxBoost. CoxBoost is not able to pre-select true 
interactions, with or without subsampling. Subsampling 
even reduces MainSensi values. Use of random forest fur- 
ther decreases MainSensi with a little compensation of 
increased IntSensi but at an interchange rate that makes 
a further reduction of rIPEC possible. Again, RSF has 
to be applied to the pre-processed data (rsf-VIF-res) for 
increasing MainSensi and IntSensi compared to rsf-VIF. 
Now, the increase in IntSensi is drastic, which leads to 
the best rIPEC value in this scenario. The IntSensi value 
is still moderate; however, one should bear in mind that 
the interactions are built by variables that do not represent 
main effects. This might be particularly relevant for real 
world applications: even a moderate variable inclusion fre- 
quency of an interaction term could indicate an important 
interaction if the underlying variables are irrelevant as 
main effects. 



Both scenarios show that all building blocks are impor- 
tant, and in particular orthogonalization is important 
before applying RSF, i.e. disentangling information before- 
hand is crucial for pre-selecting interactions. CoxBoost is 
unlikely to benefit from such a pre-processing because it 
already applies some sort of orthogonalization during fit- 
ting (further experiments also point in that direction; data 
not shown). 

That IntSensiA is often similar to IntSensi in both sce- 
narios means that one can rely on the ability of CoxBoost 
to choose the right interaction terms out of those pre- 
sented, regardless of IntScreen. Hence, it seems that the 
parameter R (number of selected interactions) can be 
quite high. For assessing the effect of R, we additionally 
investigated the same scenarios rsf-sVIF-res with 7? = 1000 
(see Additional file 1: Table SI). With this reduced R- 
value, sensitivities, VarsTotal, and rIPEC decreased; the 
latter two measures were in particular reduced in scenario 
Sim22_1.0. Thus, in case of doubt, R should be set to a 
larger value. 

We also investigated the behavior of the parameter esti- 
mates of the main effects and the interaction terms. In no 
case did a true effect receive a wrong sign. In the mean, 
shrinkage of the coefficients was stronger in scenario 
Sim42 than in Sim22_1.0. This has two reasons: higher 
absolute values of the true coefficients and increased 
number of non-zero coefficients. As the results show, this 
increased shrinkage is not relevant for the sensitivity of 
the detection strategies. One can try to reduce shrink- 
age by reducing the value of the penalty parameter or 
manually increasing the number of step sizes; however, 
we would not recommend such intervention in a high- 
dimensional setting without good reasons (see below). 
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Table 3 Results of the simulation study for rsf-VIF-res in scenarios Sim22_0.25 - Sim22_2.5 and Sim_bin 

Scenario IntScreen IntSensiA VarsTotal MainSensi IntSensi rIPEC 

CoxBoostM Final Model 



Strategy rsf-VIF-res 



Sim22_2.5 


15231.62 


0.3 (0.05) 


16.04(14.72) 


0.29 (0.05) 


0.3 (0.05) 


0 (0.06) 


0.1 (0.16) 


Sim22_2.0 


1 7200.98 


0.33 (0.05) 


18.28(14.41) 


0.51 (0.05) 


0.33 (0.05) 


0.01 (0.06) 


0.14(0.19) 


Sim22_1.5 


16878.62 


0.34 (0.05) 


20.38(14.57) 


0.67 (0.05) 


0.34 (0.05) 


0.05 (0.08) 


0.12(0.15) 


Sim22_1.0 


17701.72 


0.4 (0.05) 


19.9(12.93) 


0.81 (0.04) 


0.39 (0.05) 


0.12(0.09) 


0.14(0.14) 


Sim22_0.5 


18613.08 


0.43 (0.05) 


18.04(10.92) 


0.93 (0.03) 


0.13(0.03) 


0.21 (0.1) 


0.17(0.1) 


Sim22_0.25 


20404.02 


0.46 (0.05) 


20.4(11.63) 


0.98 (0.01) 


0(0) 


0.25 (0.11) 


0.2 (0.11) 


Sinn22_bin 


22847.06 


0.42 (0.05) 


26.22(10.9) 


0.49(0.07) 


0.34(0.05) 


0.2 (0.07) 


0.15(0.08) 



IntScreen (given as mean) is tlie number of selected interactions by the corresponding screening method; IntSensiA (given as 'sensitivity value (sd)') is the sensitivity 
related to the availability of true interactions; VarsTotal (given as 'mean (sd)') is the number of total variables in the final model; MainSensi (given as 'sensitivity value 
(sd)') is the sensitivity related to the inclusion of true main effects; and IntSensi (given as 'sensitivity value (sd)') is the sensitivity related to the inclusion of true 
interactions. The rIPEC values (given as 'mean (sd)') are shown for CoxBoostM and the final model. The scenarios were repeated 50 times. Additional file 1 : Figure S2 
provides boxplots for further insights into the nature of the variability in rIPEC. 



Table 3 shows the results for rsf-VIF-res in scenarios 
Sim22_0.25 - Sim22_2.5 and Sim_bin. The former scenar- 
ios show that with reduced effects of the interaction terms 
MainSensi increases. In other words, the final model 
ceases to find true main effects, if effect sizes of inter- 
actions terms are larger than the effect size of the main 
effects. In scenarios Sim22_e, e > 1.0, the rIPEC values are 
larger than that of CoxBoostM due to an increase both of 
MainSensi and IntSensi. Reducing the effect sizes of inter- 
actions below 1.0 seems to make them unimportant for 
the prediction performance of the final model. This leads 
to a further increase of MainSensi values and decreasing 
IntSensi values, causing rIPEC values of CoxBoostM to be 
larger than those of the final model. Nevertheless, Inter- 
SensiA has the largest value in Sim22_0.25, which means 
that random forests were frequently able to preselect the 
true interactions, even if the effect sizes of the interaction 
terms are small. Scenario Sim_bin exhibit another inter- 
esting feature: IntSensiA is larger and IntSensi smaller 
compared to scenario Sim22_1.0. This means that non- 
smooth interactions are found slightly better by the ran- 
dom forest, and yet CoxBoost was not able to select 
them all. The large sum of both sensitivity MainSensi and 
IntSensi values does not lead to an improvement of rIPEC 
compared to CoxBoostM, which probably is a result of a 
larger number of false positives in the final model. 

In Table 4 results of the scenarios with correlations are 
given. These scenarios are challenging, because random 
forests can have problems in distinguishing between cor- 
relation and interactions. The problem is dealt with exten- 
sively in [57]. There is a debate whether correlations are 
pointing to relevant associations or not (see [78-80]). Our 
focus here is on the effect of non-informative correlations 
on MainSensi and IntSensi. The results show that even 
small correlation values lead to decreased sensitivities 



compared to Sim22_1.0. Both, CoxBoost and random for- 
est are negatively but not overly affected by correlations. 
However, with correlations > 0.5 IntSensi and Main- 
Sensi decrease excessively. InterSensiA values of 0.15 and 
InterSensi values of about 0.05 suggest that most of this 
deterioration can be ascribed to CoxBoost and not to the 
random forests. The rIPEC does more or less reflect the 
tendency of reduced sensitivities. In summary, correla- 
tions do pose a problem for rsf-VIF-res, but mainly because 
of the inability of CoxBoost to select the true effects and 
not because of the random forests component. This is 
corroborated by the fact (data not shown) that true inter- 
actions were almost never replaced by interaction terms 
built by variables that correlate with variables in the true 
interactions. 

Real data illustrations 

Diffuse large-B-cell lymphoma data 

In order to illustrate how rsf-VIF-res can be applied on 
real data, we first analyzed the well-known Rosenwald 
data [81]. This data set was used to link 7399 (lym- 
phochip' cDNA microarray) gene expression features of 
240 patients with diffuse large-B-cell lymphoma (DLBCL) 
to the time of their death. DLBCL is an aggressive malig- 
nancy of mature B lymphocytes with a high rate of remis- 
sions. The objective of the Rosenwald study was to devise 
a molecular profile that accounts for the underlying het- 
erogeneity, predicts survival and can be used for assessing 
the effect of the related therapies. Overall, 138 deaths were 
observed, with a five year overall survival of 48%. The 7399 
features measured at baseline represent 4128 genes. An 
established clinical predictor, the International Prognos- 
tic Index (IPI - a combination of five clinical features), is 
available for n = 222 patients, which will be considered 
for the analysis in the following, i.e. /C = {Indices (IPI)}. 
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Table 4 Results of the simulation study for rsf-VIF-res in scenarios with correlated variables 





IntScreen 


IntSensiA 


VarsTotal 


MainSensi 


IntSensi 


rIPEC 
















CoxBoostM 


Final Model 


Sim22_corr01 


19779.06 


0.22 (0.04) 


17.76(13.25) 


0.75 (0.04) 


0.21 (0.04) 


0.11 (0.07) 


0.1 (0.11) 


Sim22_corr03 


17961.56 


0.23 (0.04) 


12.74(12.85) 


0.46 (0.05) 


0.18(0.04) 


0.06 (0.07) 


0.03 (0.08) 


Sim22_corr05 


15953.22 


0.15(0.04) 


6.66 (8.6) 


0.14(0.03) 


0.06 (0.02) 


0.03 (0.06) 


0 (0.03) 


Sim22_corr07 


14174.12 


0.15(0.04) 


7.26 (8.49) 


0.06 (0.02) 


0.04 (0.02) 


-0.03 (0.2) 


0.01 (0.3) 



IntScreen (given as mean) is the number of selected interactions by the corresponding screening method; IntSensiA (given as 'sensitivity value (sd)') is the sensitivity 
related to the availability of true interactions; VarsTotal (given as 'mean (sd)') is the number of total variables in the final model; MainSensi (given as 'sensitivity value 
(sd)') is the sensitivity related to the inclusion of true main effects; and IntSensi (given as 'sensitivity value (sd)') is the sensitivity related to the inclusion of true 
interactions. The rIPEC values (given as 'mean (sd)') are shown for CoxBoostM and the final model. The scenarios were repeated 50 times. 



For further details and an overview with respect to vari- 
ous strategies for analyzing this and related data sets, we 
refer to [82]. 

We were interested in gaining new insights by incorpo- 
rating interactions together with main effects. In almost 
all previous analyses of the Rosenwald data, at least four 
genes exhibited strong main effects. Our assumption was 
that there should be relevant interactions as well. Even 
though, it is frequently reasonable to assume complex 
and non-linear interactions, using cross product terms 
should be a first step in enriching the molecular pro- 
file. rsf-VIF-res was applied with R = 10,000 and on 
50 subsamples of the original data set. The prediction 
error curves of CoxBoostM and of rsf-VIF-res are given in 
Figure 1. The mean prediction errors (bold dashed lines) 
show that rsf-VIF-res performs slightly worse than Cox- 
BoostM. Based on our assumption that there should be 



relevant interactions, the simulation study suggests that 
slight reduction of the prediction performance might still 
point to interaction effect sizes that are moderate, i.e., 
below the effect sizes of the main effects but not negligible 
(see Sim22_0.25). 

The three main effects and gene-gene-interactions, 
given in Unigene cluster notation, related to the largest 
relative VIFs (in parentheses) are: 

- Hs.184298 (0.66), Hs.99741 (0.54), Hs.85769 (0.44) 
and 

- Hs.76807:Hs.84298 (0.10), Hs.79428:Hs.l93736 
(0.08), Hs.20191:Hs.99597 (0.06) 

The underlying genes of the interactions represent no 
relevant main effects for CoxBoostM, and the VIFs of 
these interactions are low. In order to increase certainty 
concerning the interactions, we manually increased the 
step size of the final model to 500. There, the same main 
effects are associated with slightly higher VIFs (0.72, 0.6, 
0.52). The changes for the interactions are more inter- 
esting: two new interaction terms are among the inter- 
actions with the largest VIFs and the relative VIF values 
increased to 0.22, 0.16, and 0.14 for Hs.20191:Hs.79428, 
Hs.20191:Hs.28777, and Hs.76807:Hs.84298, respectively. 
From the considerable increase of relative VIFs, we con- 
cluded that these interactions might be more reliable. 
Figure 2 shows the connections between the genes in 
selected interaction terms with relative VIFs > 3/50 
(the bolder the edges, the higher the corresponding VIF 
values). Our observations in the simulation study (specif- 
ically Sim22_0.5) indicate that the most frequent interac- 
tion term Hs.20191:Hs.79428 could be relevant, although 
its frequency is moderate. However, mean model size 
increased drastically from 24 to 84 and led to an rIPEC 
value of 0, so, in order to corroborate our conclusion 
further, we went back to molecular biological informa- 
tion. From KEGG (Kyoto Encyclopedia of Genes and 
Genomes), we retrieved the pathways of the genes in 
the interaction term Hs.20191:Hs.79428. The proteins 
of these genes (SIAH and BNIP3) are elements in the 
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Figure 1 Prediction error curves on tiie Rosenwald data. Shown 
are the curves for the Kaplan-Meier estimates, CoxBoostM and 
rsf-VIF-res on all subsampled data. The bold curves are the aggregated 
curves over all subsamples. Additionally, rIPEC values are given. 
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Figure 2 Network graph for the Rosenwald data depicting the connections between those interaction that are found with relative VIF 


> 3/50. The thickness of an edge reflects the value of the corresponding VIF, i.e., the VIF value of an interaction term, the bolder the edge between 


the variables in that interaction term. For microarray features that do not correspond to a gene, the feature ID is given. Every other molecular feature 


is represented by the Unigene cluster notation. 





pathway related to (mitochondrial) apoptosis [83,84]. In 
addition to that, they have a role in the cellular response 
to hypoxia [85,86]. The expression values of both genes 
lead us to the assumption that the corresponding pro- 
teins might interact complementarily: common down/up- 
regulation with respect to the apoptotic function and 
to hypoxial-induced reactions might have an impact on 
tumor genesis and growth (see also [87,88] for further 
evidence). 

Neuroblastoma data 

A further real-world example is related to the microar- 
ray data set of Oberthuer et al [89]. It consists of n = 
276 patients suffering from neuroblastoma. Overall, 42 
deaths were observed and the median survival time is 
632 days. For each patient, p = 9, 986 microarray fea- 
tures are available, and we concentrate on the relationship 
between survival and these microarray features. The same 
parameter values as for the Rosenwald data are used but 
with no clinical covariates, i.e. /C = { }. The prediction 
error curves of CoxBoostM and of rsf-VIF-res are given in 
Figure 3. Again, the mean prediction errors (bold dashed 
lines) indicate that rsf-VIF-res performs slightly worse than 
CoxBoostM. 

The three main effects and gene-gene-interactions with 
the largest relative VIFs are: 

- Hs.496658 (0.68), Hs.491494 (0.58), 
Hs.584827(0.54) and 

- Hs.496658:Hs.l48989 (0.28), Hs.496658:Hs.371249 
(0.18), Hs.496658:Hs.532824 (0.18) 



VIFs are higher than for the Rosenwald data. Based on 
the simulation study results, the VIFs might be consid- 
ered large enough for indicating important interactions. 
Hs.496658 is the most relevant gene entity: it contributes 
the largest main effect VIF and is involved in interac- 
tions with the largest VIFs. The corresponding gene name 
is SLC25A5, and the product of this gene functions as 
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Figure 3 Prediction error curves on tiie Neuroblastoma data. 

Shown are the curves for the Kaplan-Meier estimates, CoxBoostM and 
rsf-VIF-res on all subsampled data. The bold curves are the aggregated 
curves over all subsamples. Additionally, rIPEC values are given. 
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a gated pore that translocates ADP from the mitochon- 
drial matrix into the cytoplasm. Suppressed expression 
of this gene has been shown to induce apoptosis and 
inhibit tumor growth (see the corresponding entry in 
the database of NCBI). Figure 4 shows the connections 
between the genes from interactions with relative VIFs > 
3/50. The graph is more complex than Figure 2, which 
translates into an increased uncertainty with respect to 
the relevance of the interactions. For example, Hs. 148989 
is gene CGNLl, which encodes a protein that localizes to 
both adherens and tight cell-cell junctions and mediates 
junction assembly and maintenance (see the correspond- 
ing entry in the database of NCBI). There could be a 
real interaction between both genes (e.g., when cell-cell 
junctions break loose, apoptosis cannot be induced), but 
further biological validation would definitely be necessary. 

Discussion 

From the results of the simulation study, we conclude 
that random forests can provide relevant interaction infor- 
mation. If the interaction is strong enough, the marginal 
effects of the underlying variables are at a level such that 
they are frequently selected as split variables in the ran- 
dom forest generation process. Further, the results indi- 
cate that disentangling information also is important for 



achieving good results. The reason behind this might be 
that variables associated to main effects can mask inter- 
actions in random forests, which affects the split variable 
selection process. Disentanglement of information specif- 
ically means to transform variables to be orthogonal to 
those with indices in (A4 U /C). When the number of esti- 
mated main effects in CoxBoostM is too large (rule of 
thumb: more than about jq • n [90]), the corresponding 
regressions can be unreliable. In this case, we would rec- 
ommend focusing on those main effects with the largest 
absolute coefficient estimates in CoxBoostM. Another 
possibility is to use the linear predictor Ip = ^i^j^ PiXi 
for the regression jVyX = ci • Ip/, + 6y^, y ^ A^, /c = 1, . . . , 
In both cases, the orthogonalization is imperfect and 
results (not shown) based on the latter variant indicated 
that sensitivities related to interactions are considerably 
lower than with the strategy for computing residuals pro- 
posed here. However, an alternative should be taken into 
consideration, when the number of cases is small (e.g., 
smaller than 50). 

The scenarios with correlation and non-smooth inter- 
actions show that the pre-selection of interactions is less 
affected than the final CoxBoost model. For non-smooth 
interactions, this was expected due to the non-smooth 
nature of individual trees in random forests, but the 
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Figure 4 Network graph for the Neuroblastoma data depicting the connections between those interaction that are found with relative 
VIF > 3/50. The thickness of an edge reflects the value of the corresponding VIF, i.e., theVIF value of an interaction term, the bolder the edge 
between the variables in that interaction term. The molecular features feature are represented by the Unigene cluster notation. 
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effects on correlated data indicate that the pre-selection 
of interactions in rsf-VIF-res is quite robust. One further 
interpretation of the simulation study is that moderate 
variable inclusion frequency of an interaction term (e.g., 
10% — 30%) still could indicate an important interaction. 
The real data example showed that uncertainty related 
to the reliability of the findings can make it necessary to 
consider and contextualize as much information as possi- 
ble. Specifically, increasing the step size of CoxBoost from 
its optimal value to 500 in the Rosenwald data was an 
attempt to reduce the uncertainty. Due to the consider- 
able deterioration of prediction performance, the decision 
on the importance of the identified interactions was based 
on additional biological knowledge. There is no absolute 
threshold with respect to a decrease in prediction per- 
formance that makes the results definitely unreliable. The 
results showed that detection of true interaction and main 
effects can be accompanied by deteriorated or bad predic- 
tion performances due to the increase in false positives. 
It always depends on the subject-matter question whether 
a certain level of prediction performance is deemed nec- 
essary. If biology can help sorting out the true effects, 
concerns related to prediction performance even might be 
considered secondary. 

The results showed that the number of pre-selected 
interactions R must be large enough (>> 1000 in our 
data sets) for guaranteeing that the screening process 
is able to pre-select relevant effects. CoxBoost was fre- 
quently able to select the right variables out of ten thou- 
sands of variables. This is a feature of many other (Li-) 
regularized regression techniques such as the LASSO (see 
also [91,92]), which (under sparsity assumptions) also are 
consistent for variable selection, even when the number 
of variables p is as large as exp(w") for some 0 < a < 1 
[93]. Empirically determining an optimal R is neverthe- 
less difficult. This issue certainly needs further scientific 
investigations. 

Limitations 

Due to the focus of this paper and the limited space, 
our study has several limitations. First, only two-way 
interactions were considered in the interaction screening 
process. In real- world data, all kind of multifactor and 
non-linear interactions can be expected. Second, the sim- 
ulation scenarios are limited in their scope, because we 
focused on one critical issue: the effects of the building 
blocks when interactions are built from variables that do 
not represent main effects. Although, we also investigated 
simulations scenarios with correlations, further investi- 
gations of informative correlations and more complex 
correlations structures are relevant. 

Third, the real-data applications showed that the strate- 
gies cannot be used in an automatic way. Decisions related 
to the choice of some parameter values (e.g., number of 



subsamples S or indices of unpenalized variables /C), inter- 
pretation of the results, and further processing of these 
results have to be based on subject-matter knowledge and 
the specific application. Such requirements could discour- 
age a user from using rsf-VIF-res. Nevertheless, it should 
be clear that assessing the necessity of considering inter- 
actions is not trivial, even for the simplest case of gene- 
gene interactions and therefore informed decisions are 
crucial. 

Fourth, there are open questions such as the specific 
value for R or alternatives to the building blocks presented 
in the paper. There are several routes for extending our 
proposal or replacing components in it. Fifth, we only con- 
sidered proportional hazard models and simulated data 
from such models. It is important to consider departures 
from the related assumptions in future studies, for exam- 
ple by considering time-dependent effects. Finally, the real 
data examples only considered microarray data. Recent 
sequencing approaches, such as the RNA-Seq technology, 
are gaining more and more ground and should be targeted 
as well. 

Conclusion 

Our aim in this study was to build a strategy for incorpo- 
rating two-way interactions into multivariate risk predic- 
tion models that are built on high-dimensional molecular 
data. When it is either not feasible or computation- 
ally too expensive to consider all possible interactions, 
screening is necessary in case of no a priori knowl- 
edge. We presented three important building blocks for 
such a screening strategy: subsampling, random forests, 
and orthogonalization of the data, and concluded that 
all building blocks are important. Our decision for using 
random forests for screening interactions has one main 
reason: the promise of random forests to capture var- 
ious kinds of relevant interaction structures. CoxBoost 
was used, because it usually produces sparse risk predic- 
tion models. We assumed that a combination of these 
two approaches could be fruitful due to their comple- 
mentary character. However, components can be sepa- 
rately replaced by other ones, for example random forests 
by multifactor-dimensionality reduction, and such flexi- 
bility seems necessary, because no specific combination 
of building blocks will perform well on every kind of 
data. 

The results show that screening interactions through 
random forests is feasible and useful, when one is inter- 
ested in finding relevant two-way interactions. Effect sizes 
of the interactions should be large enough in order to 
guarantee useful results. When the underlying variables 
do not represent main effects, sensitivities related to vari- 
able and interaction selection are moderate (< 40%). 
The results of the simulation study indicates that mak- 
ing all variables orthogonal to those with indices in (M U 
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/C) could enable random forests to pre-select relevant 
interaction effects even in the absence of strong marginal 
components. 

The real data applications showed that not only pre- 
processing and a combination of different tools are 
important for interaction detection but also an intelligent 
post-processing. Our final conclusion is that in addition 
to focusing on establishing new methods, it is important 
to make full use of existing ones. 
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